Wrapper Generation Supervised by a Noisy Crowd

نویسندگان

Valter Crescenzi

Paolo Merialdo

Disheng Qiu

چکیده

We present solutions based on crowdsourcing platforms to support large-scale production of accurate wrappers around data-intensive websites. Our approach is based on supervised wrapper induction algorithms which demand the burden of generating the training data to the workers of a crowdsourcing platform. Workers are paid for answering simple membership queries chosen by the system. We present two algorithms: a single worker algorithm (alfη) and a multiple workers algorithm (alfred). Both the algorithms deal with the inherent uncertainty of the responses and use an active learning approach to select the most informative queries. alfred estimates the workers’ error rate to decide at runtime how many workers are needed. The experiments that we conducted on real and synthetic data are encouraging: our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semi-Automatic Wrapper Generation for Commercial Web Sources

Semi-automatic wrapper generation tools aim to ease the task of building structured views over semi-structured web sources. But the wrapper generation techniques presented up to date are unable to properly deal with sources requiring complex navigational sequences for accessing data. In this paper, we present Wargo, a semi-automatic wrapper generation tool, which has been used by non-programmer...

متن کامل

Supervised Wrapper Generation with Lixto

We illustrate basic features of the Lixto wrapper generator such as the user and system interaction, the capacious visual interface, the marking and selecting procedures, and the extraction tasks by describing the construction of a simple example program in the current Lixto prototype.

متن کامل

Controlling the effect of crowd noisy annotations in NLP Tasks

Natural Language Processing (NLP) is a sub-field of Artificial Intelligence and Linguistics, with the aim of studying problems in the automatic generation and understanding of natural language. It involves identifying and exploiting linguistic rules and variation with code to translate unstructured language data into information with a schema. Empirical methods in NLP employ machine learning te...

متن کامل

DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web

The web is a rich resource of structured data. There has been an increasing interest in using web structured data for many applications such as data integration, web search and question answering. In this paper, we present DEXTER, a system to find product sites on the web, and detect and extract product specifications from them. Since product specifications exist in multiple product sites, our ...

متن کامل

A Supervised Visual Wrapper Generator for Web-Data Extraction

Extracting data from Web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interest. In this paper, we propose a novel schema-guided approach to wrapper generation. We provide a user-friendly interface that allows users to define the schema of the data to be extracted, and specifies mappings from a HTML page to the target schema. Based on...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Wrapper Generation Supervised by a Noisy Crowd

نویسندگان

چکیده

منابع مشابه

Semi-Automatic Wrapper Generation for Commercial Web Sources

Supervised Wrapper Generation with Lixto

Controlling the effect of crowd noisy annotations in NLP Tasks

DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web

A Supervised Visual Wrapper Generator for Web-Data Extraction

عنوان ژورنال:

اشتراک گذاری